NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Fast Instruction Selection for Fast Digital Signal Processing

https://doi.org/10.1145/3623278.3624768

Root, Alexander J; Ahmad, Maaz_Bin Safeer; Sharlet, Dillon; Adams, Andrew; Kamil, Shoaib; Ragan-Kelley, Jonathan (March 2023, ACM)

Modern vector processors support a wide variety of instructions for fixed-point digital signal processing. These instructions support a proliferation of rounding, saturating, and type conversion modes, and are often fused combinations of more primitive operations. While these are common idioms in fixed-point signal processing, it is difficult to use these operations in portable code. It is challenging for programmers to write down portable integer arithmetic in a C-like language that corresponds exactly to one of these instructions, and even more challenging for compilers to recognize when these instructions can be used. Our system, Pitchfork, defines a portable fixed-point intermediate representation, FPIR, that captures common idioms in fixed-point code. FPIR can be used directly by programmers experienced with fixed-point, or Pitchfork can automatically lift from integer operations into FPIR using a term-rewriting system (TRS) composed of verified manual and automatically-synthesized rules. Pitchfork then lowers from FPIR into target-specific fixed-point instructions using a set of target-specific TRSs. We show that this approach improves runtime performance of portably-written fixed-point signal processing code in Halide, across a range of benchmarks, by geomean 1.31× on x86 with AVX2, 1.82× on ARM Neon, and 2.44× on Hexagon HVX compared to a standard LLVM-based compiler flow, while maintaining or improving existing compile times.
more » « less
Full Text Available
Sparsity-Specific Code Optimization using Expression Trees

https://doi.org/10.1145/3520484

Herholz, Philipp; Tang, Xuan; Schneider, Teseo; Kamil, Shoaib; Panozzo, Daniele; Sorkine-Hornung, Olga (October 2022, ACM Transactions on Graphics)

We introduce a code generator that converts unoptimized C++ code operating on sparse data into vectorized and parallel CPU or GPU kernels. Our approach unrolls the computation into a massive expression graph, performs redundant expression elimination, grouping, and then generates an architecture-specific kernel to solve the same problem, assuming that the sparsity pattern is fixed, which is a common scenario in many applications in computer graphics and scientific computing. We show that our approach scales to large problems and can achieve speedups of two orders of magnitude on CPUs and three orders of magnitude on GPUs, compared to a set of manually optimized CPU baselines. To demonstrate the practical applicability of our approach, we employ it to optimize popular algorithms with applications to physical simulation and interactive mesh deformation.
more » « less
Full Text Available
Searching for Fast Demosaicking Algorithms

https://doi.org/10.1145/3508461

Ma, Karima; Gharbi, Michael; Adams, Andrew; Kamil, Shoaib; Li, Tzu-Mao; Barnes, Connelly; Ragan-Kelley, Jonathan (October 2022, ACM Transactions on Graphics)

We present a method to automatically synthesize efficient, high-quality demosaicking algorithms, across a range of computational budgets, given a loss function and training data. It performs a multi-objective, discrete-continuous optimization which simultaneously solves for the program structure and parameters that best tradeoff computational cost and image quality. We design the method to exploit domain-specific structure for search efficiency. We apply it to several tasks, including demosaicking both Bayer and Fuji X-Trans color filter patterns, as well as joint demosaicking and super-resolution. In a few days on 8 GPUs, it produces a family of algorithms that significantly improves image quality relative to the prior state-of-the-art across a range of computational budgets from 10 s to 1000 s of operations per pixel (1 dB–3 dB higher quality at the same cost, or 8.5–200× higher throughput at same or better quality). The resulting programs combine features of both classical and deep learning-based demosaicking algorithms into more efficient hybrid combinations, which are bandwidth-efficient and vectorizable by construction. Finally, our method automatically schedules and compiles all generated programs into optimized SIMD code for modern processors.
more » « less
Full Text Available
A Cross-Platform Benchmark for Interval Computation Libraries

Tang Xuan; Ferguson, Zachary; Schneider, Teseo; Zorin, Denis; Kamil, Shoaib; Panozzo, Daniele (January 2022, Parallel Processing and Applied Mathematics: 14th International Conference, PPAM 2022)

Interval computation is widely used in Computer Aided Design to certify computations that use floating point operations to avoid pitfalls related to rounding error introduced by inaccurate operations. Despite its popularity and practical benefits, support for interval arithmetic is not standardized nor available in mainstream programming languages. We propose the first benchmark for interval computations, coupled with reference solutions computed with exact arithmetic, and compare popular C and C++ libraries over different architectures, operating systems, and compilers. The benchmark allows identifying limitations in existing implementations, and provides a reliable guide on which library to use on each system for different CAD applications. We believe that our benchmark will be useful for developers of future interval libraries, as a way to test the correctness and performance of their algorithms.
more » « less
Full Text Available
Compiling Graph Applications for GPU s with GraphIt

https://doi.org/10.1109/CGO51591.2021.9370321

Brahmakshatriya, Ajay; Zhang, Yunming; Hong, Changwan; Kamil, Shoaib; Shun, Julian; Amarasinghe, Saman (February 2021, 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO))
null (Ed.)
Full Text Available
Automatically translating image processing libraries to halide

https://doi.org/10.1145/3355089.3356549

Ahmad, Maaz Bin; Ragan-Kelley, Jonathan; Cheung, Alvin; Kamil, Shoaib (November 2019, ACM Transactions on Graphics)

Full Text Available
EGGS: Sparsity-Specific Code Generation

Tang, Xuan; Schneider, Teseo; Kamil, Shoaib; Panda, Aurojit; Li, Jinyang; Panozzo, Daniele (January 2020, Computer graphics forum)

Full Text Available
Optimizing ordered graph algorithms with GraphIt

https://doi.org/10.1145/3368826.3377909

Zhang, Yunming; Brahmakshatriya, Ajay; Chen, Xinyi; Dhulipala, Laxman; Kamil, Shoaib; Amarasinghe, Saman; Shun, Julian (February 2020, roceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO ’20))

Full Text Available
Optimizing Ordered Graph Algorithms with GraphIt

Zhang, Yunming; Brahmakshatriya, Ajay; Chen, Xinyi; Dhulipala, Laxman; Kamil, Shoaib; Amarasinghe, Saman; Shun, Julian (February 2020, International Symposium on Code Generation and Optimization)

Many graph problems can be solved using ordered parallel graph algorithms that achieve significant speedup over their unordered counterparts by reducing redundant work. This paper introduces a new priority-based extension to GraphIt, a domain-specific language for writing graph applications, to simplify writing high-performance parallel ordered graph algorithms. The extension enables vertices to be processed in a dynamic order while hiding low-level implementation details from the user. We extend the compiler with new program analyses, transformations, and code generation to produce fast implementations of ordered parallel graph algorithms. We also introduce bucket fusion, a new performance optimization that fuses together different rounds of ordered algorithms to reduce synchronization overhead, resulting in 1.2x--3x speedup over the fastest existing ordered algorithm implementations on road networks with large diameters. With the extension, GraphIt achieves up to 3x speedup on six ordered graph algorithms over state-of-the-art frameworks and hand-optimized implementations (Julienne, Galois, and GAPBS) that support ordered algorithms.
more » « less
Full Text Available
ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism

https://doi.org/10.1109/SC.2018.00065

Cheshmi, Kazem; Kamil, Shoaib; Strout, Michelle Mills; Dehnavi, Maryam Mehri (November 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis)

Full Text Available

« Prev Next »

Search for: All records